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Abstract 

Motivated by questions in lossy data compression and by theoretical considerations, we examine the problem 
of estimating the rate-distortion function of an unknown (not necessarily discrete-valued) source from empirical 
data. Our focus is the behavior of the so-called "plug-in" estimator, which is simply the rate-distortion function of 
the empirical distribution of the observed data. Sufficient conditions are given for its consistency, and examples are 
provided to demonstrate that in certain cases it fails to converge to the true rate-distortion function. The analysis of its 
[ performance is complicated by the fact that the rate-distortion function is not continuous in the source distribution; 

the underlying mathematical problem is closely related to the classical problem of establishing the consistency 
$_i ' of maximum likelihood estimators. General consistency results are given for the plug-in estimator applied to a 

broad class of sources, including all stationary and ergodic ones. A more general class of estimation problems is 
\ also considered, arising in the context of lossy data compression when the allowed class of coding distributions 

■ is restricted; analogous results are developed for the plug-in estimator in that case. Finally, consistency theorems 

y—i \ are formulated for modified (e.g., penalized) versions of the plug-in, and for estimating the optimal reproduction 

distribution. 

H ■ 

hh ' Index Terms 

q \ Rate-distortion function, entropy, estimation, consistency, maximum likelihood, plug-in estimator 

^ ; I. Introduction 

Suppose a data string x™ := (xi, x 2 , ■ ■ ■ , x„) is generated by a stationary memoryless source (X n ; n > 
^ \ 1) with unknown marginal distribution P on a discrete alphabet A. In many theoretical and practical 
(N) ■ problems arising in a wide variety of scientific contexts, it is desirable - and often important - to obtain 
O , accurate estimates of the entropy H(P) of the source, based on the observed data x"; see, for example, 
q ! [35] [26] [30] [29] [32] [31] [8] and the references therein. Perhaps the simplest method is via the so-called 
""^5 \ plug-in estimator, where the entropy of P is estimated by H(P x n), namely, the entropy of the empirical 
O ■ distribution P x n of x™. The plug-in estimator satisfies the basic statistical requirement of consistency, that 
> ■ is, H(P x n) — > H(P) in probability as n — ► oo. In fact, it is strongly consistent; the convergence holds 
^ with probability one [2]. 

A natural generalization is the problem of estimating the rate-distortion function R(P, D) of a (not 
necessarily discrete-valued) source. Motivation for this comes in part from lossy data compression, where 
we may need an estimate of how well a given data set could potentially be compressed, cf. [10], and also 
from cases where we want to quantify the "information content" of a particular signal, but the data under 
examination take values in a continuous (or more general) alphabet, cf. [27]. 

The rate-distortion function estimation question appears to have received little attention in the literature. 
Here we present some basic results for this problem. First, we consider the simple plug-in estimator 
R(P x n, D), and determine conditions under which it is strongly consistent, that is, it converges to R(P, D) 
with probability 1, as n — > oo. We call this the nonparametric estimation problem, for reasons that will 
become clear below. 

A shorter version of this paper is to appear in IEEE Transactions on Information Theory. 
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At first glance, consistency may seem to be a mere continuity issue: Since the empirical distribution 
Pjq* converges, with probability 1, to the true distribution P as n — > oo, a natural approach to proving 
that R(P X n, D) also converges to -R(P, -D) would be to try and establish some sort of continuity property 
for R(P, D) as a function of P. But, as we shall see, R(Px™, D) turns out to be consistent under rather 
mild assumptions, which are in fact too mild to ensure continuity in any of the usual topologies; see 
Section ITII-E I for explicit counterexamples. This also explains our choice of the empirical distribution P x n 
as an estimate for P: If R(P, D) was continuous in P, then any consistent estimator P n of P could be used 
to make R(P n ,D) a consistent estimator for R(P,D). Some of the subtleties in establishing regularity 
properties of the rate-distortion function R(P, D) as a function of P are illustrated in [11] [1]. 

Another advantage of a plug-in estimator is that P x n has finite support, regardless of the source alphabet. 
This makes it possible (when the reproduction alphabet is also finite) to actually compute R(P x n,D) by 
approximation techniques such as the Blahut-Arimoto algorithm [7] [3] [12]. When the reproduction 
alphabet is continuous, the Blahut-Arimoto algorithm can still be used after discretizing the reproduction 
alphabet; the discretization can, in part, be justified by the observation that it can be viewed as an instance 
of the parametric estimation problem described below. Other possibilities for continuous reproduction 
alphabets are explored in [33] [5]. 

The consistency problem can be framed in the following more general setting. As has been observed 
by several authors recently, the rate-distortion function of a memoryless source admits the decomposition, 

R{P,D) =inf R{P,Q,D), (1) 

Q 

where the infimum is over all probability distributions Q on the reproduction alphabet, and R(P, Q, D) is 
the rate achieved by memoryless random codebooks with distribution Q used to compress the source data 
to within distortion D; see, e.g., [36] [15]. Therefore, R(P, D) is the best rate that can be achieved by this 
family of codebooks. But in the case where we only have a restricted family of compression algorithms 
available, indexed, say, by a family of probability distributions {Q e ; 9 £ 0} on the reproduction alphabet, 
then the best achievable rate is: 

R e (P,D) := inf R(P,Q e ,D). (2) 

We also consider the parametric estimation problem, namely, that of establishing the strong consistency 
of the corresponding plug-in estimator R e (P x ™,D) as an estimator for R e (P,D). It is important to 
note that, when 6 indexes the set of all probability distributions on the reproduction alphabet, then the 
parametric and nonparametric problems are identical, and this allows us to treat both problems in a 
common framework. 

Our two main results, Theorems @] and [5] in the following section, give regularity conditions for both the 
parametric and nonparametric estimation problems under which the plug-in estimator is strongly consistent. 
It is shown that consistency holds in great generality for all distortion values D such that R e (P,D) is 
continuous from the left. An example illustrating that consistency may actually fail at those points is given 
in Section IIII-DI In particular, for the nonparametric estimation problem we obtain the following three 
simple corollaries, which cover many practical cases. 

Corollary 1: If the reproduction alphabet is finite, then for any source distribution P, R(P X «,D) is 
strongly consistent for R(P, D) at all distortion levels D > except perhaps at the single value where 
R(P, D) transitions from being finite to being infinite. 

Corollary 2: If the source and reproduction alphabets are both equal to M. d and the distortion measure 
is squared-error, then for any source distribution P and any distortion level D > 0, R(P X ™, D) is strongly 
consistent for R(P, D). 

Corollary 3: Assume that the reproduction alphabet is a compact, separable metric space, and that 
the distortion measure p(x, •) is continuous for each x G A. Then (under mild additional measurability 
assumptions), for any source distribution P, R(P X ™, D) is strongly consistent for R(P, D) at all distortion 
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levels D > except perhaps at the single value where R(P, D) transitions from being finite to being 
infinite. 

Corollaries Q] and [3] are special cases of Corollary [6] in Section lU Corollary [2] is established in 
Section Unl which contains many other explicit examples illustrating the consistency results and cases 
where consistency may fail. Section |V] contains the proofs of all the main results in this paper. 

We also consider extensions of these results in two directions. In Section ITV-AI we examine the problem 
of estimating the optimal reproduction distribution - namely, the distribution that actually achieves the 
infimum in equations (OQ) and © - from empirical data. Consistency results are given, under conditions 
identical to those required for the consistency of the plug-in estimator. Finally, in Section IIV-BI we show 
that consistency holds for a more general class of estimators, which arise as modifications of the plug- 
in. These include, in particular, penalized versions of the plug-in, analogous to the standard penalized 
maximum likelihood estimators used in statistics. 

The analysis of the plug-in estimator presents some unexpected technical difficulties. One way to explain 
the source of these difficulties is by noting that there is a very close analogy, at least on the level of the 
mathematics, with the problem of maximum likelihood estimation [see also Section IIV-BI for another 
instance of this connection]. Beyond the superficial observation that they are both extremization problems 
over a space of probability distributions, a more accurate, albeit heuristic, illustration can be given as 
follows: Suppose we have a memoryless source with distribution P on some discrete alphabet, take the 
reproduction alphabet to be the same as the source alphabet, and look at the extreme case where no 
distortion is allowed. Then the plug-in estimator of the rate-distortion function (which now is simply the 
entropy) can be expressed as a trivial minimization over all possible coding distributions, i.e., 

H(P x n) = mm[H(P xf ) + H(P x n\\Q)} = -- max [logQ^O], 
Q n Q 

where H(P\\Q) denotes the relative entropy, and Q n is the n-fold product distribution of n independent 
random variables each distributed according to Q. Therefore, the computation of the plug-in estimate 
H(P x n) is exactly equivalent to the computation of the maximum likelihood estimate (MLE) of P over a 
class of distributions Q. Alternatively, in Csiszar's terminology, the minimization of the relative entropy 
above corresponds to the so-called "reversed /-projection" of P x n onto the set of feasible distributions Q, 
which in this case consists of all distributions on the reproduction alphabet; see, e.g., [16] [13]. Formally, 
this projection is exactly the same as the computation of the MLE of P based on x™. 

In the general case of nonzero distortion D > 0, the plug-in estimator can similarly be expressed as, 
R(P x n,D) = minQ FL(P x n,Q, D), cf. (0Q) above. This (now highly nontrivial) minimization is mathemat- 
ically very closely related to the problem of computing an /-projection as before. The tools we employ 
to analyze this minimization are based on the technique of epigraphical convergence [34] [4] (this is 
particularly clear in the proof of our main result, the lower bound in Theorem [5]), and it is no coincidence 
that these same tools have also provided one of the most successful approaches to proving the consistency 
of MLEs. By the same token, this connection also explains why the consistency of the plug-in estimator 
involves subtleties similar to those cases where MLEs fail to be consistent [28]. 

In the way of motivation, we also mention that the asymptotic behavior of the plug-in estimator - and 
the technical intricacies involved in its analysis - also turn out to be important in extending some of 
Rissanen's celebrated ideas related to the Minimum Description Length (MDL) principle to the context 
of lossy data compression; this direction will be explored in subsequent work. 

Throughout the paper we work with stationary and ergodic sources instead of memoryless sources, 
though we are still only interested in estimating the first-order rate-distortion function. One reason for 
this is that the full rate-distortion function can be estimated by looking at the process in sliding blocks 
of length m and then estimating the "marginal" rate-distortion function of these blocks for large m; see 
Section IIII-FI Another reason for allowing dependence in the data comes from simulation: For example, 
suppose we were interested in estimating the rate-distortion function of a distribution P that we cannot 
compute explicitly (as is the case for perhaps the majority of models used in image processing), but 
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for which we have a Markov chain Monte Carlo (MCMC) sampling algorithm. The data generated by 
such an algorithm is not memoryless, yet we care only about the rate-distortion function of the marginal 
distribution. In Section IIV-CI we comment further on this issue, and also give consistency results for data 
produced by sources that may not be stationary. 



II. Main Results 

We begin with some notation and definitions that will remain in effect throughout the paper. 

Suppose the random source (X n ; n > 1) taking values in the source alphabet A is to be compressed 
in the reproduction alphabet A, with respect to the single-letter distortion measures (p n ) arising from an 
arbitrary distortion function p : Ax A [0, oo). We assume that A and A are equipped with the a-algebras 
A and A, respectively, that (A, A) and (A, A) are Borel spaces, and that p is a (A x .4) -measurable Q 
Suppose the source is stationary, and let P denote its marginal distribution on A. Then the (first-order) 
rate-distortion function Ri(P,D) with respect to the distortion measure p is defined as, 

Ri(P, D) := inf I(U;V), D > 0, 

where the infimum is over all A x A-valued random variables (U, V) with joint distribution W belonging 
to the set 

W(P, D):={W:W A = P, E w \p(U, V)} < D) , 

and where W A denotes the marginal distribution of W on A, and similarly for W A ; the infimum is taken 
to be +oo when W(P, D) is empty. As usual, the mutual information I(U;V) between two random 
variables U, V with joint distribution W, is defined as the relative entropy between W and the product of 
its two marginals, W A x W A . Here and throughout the paper, all familiar information-theoretic quantities 
are expressed in nats, and log denotes the natural logarithm. In particular, for any two probability measures 
p:, v on the same space, the relative entropy H(fi\\u) is defined as -E^log ^f] whenever the density dpjdv 
exists, and it is taken to be +oo otherwise. 

We write D C (P) for the set of distortion values D > for which R±(P, D) is continuous from the left, 
i.e., 

D C (P) := {D > : Rx(P,D) = lim^i^P, XD)} . 

By convention, this set always includes and any value of D for which Ri(P, D) = oo. But since 
Ri(P, D) is nonincreasing and convex in D [9] [11], D C (P) actually includes all D > with the only 
possible exception of the single value of D where Ri(P, D) transitions from being finite to being infinite. 
Conditions guaranteeing that D C (P) is indeed all of [0, oo) can be found in [11]. 



A. Estimation Problems and Plug-in Estimators 

Given a finite-length data string x\ := {x\, x 2 , ■ ■ ■ , x n ) produced by a stationary source (X n ) as above 
with marginal distribution P, the plug-in estimator of the first-order rate-distortion function R\(P, D) is 
Ri(P x n,D), where P x n is the empirical distribution induced by the sample x\ on A n , namely, 

1 n 

P x n(C) := - V l{x k eC} x n ,eA n , CeA 

n L — ' 

k=l 

and where 1 is the indicator function. Our first goal is to obtain conditions under which this estimator is 
strongly consistent. We call this the nonparametric estimation problem. 

'Borel spaces include the Euclidean spaces R d as well as all Polish spaces, and they allow us to avoid certain measure-theoretic pathologies 
while working with random sequences and conditional distributions [25]. Henceforth, all a-algebras and the various product a-algebras derived 
from them are understood from the context. We do not complete any of the cr-algebras, but we say that an event C holds with probability 
1 (w.p.l) if C contains a measurable subset C that has probability 1. 
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We also consider the more general class of estimation problems mentioned in the Introduction. Suppose 
for a moment that our goal is to compress data produced by a memoryless source (X n ) with distribution P 
on A, and suppose also that we are restricted to using memoryless random codebooks with distributions 
Q belonging to some parametric family {Qg : 9 E 0} where indexes a subset of all probability 
distributions on A. Using a random codebook with distribution Q to compress the data to within distortion 
D, yields (asymptotically) a rate of R±(P, Q, D) nats/symbol, where the rate-function Ri(P, Q, D) is given 
by, 

R 1 (P,Q,D)= inf H(W\\PxQ). 

we\v(P,D) 

See [36] [15] for details. From this it is immediate that the rate-distortion function of the source admits the 
decomposition given in ©. Having restricted attention to the class of codebook distributions {Qg ; 9 E 0}, 
then the best possible compression rate is: 

Rf (P, D) := inf R^P, Q e , D) nats/symbol. (3) 

When 9 indexes certain nice families, say Gaussian, the infimum Rf(P,D) can be analytically derived 
or easily computed, often for any distribution P, including an empirical distribution. 

Thus motivated, we now formally define the parametric estimation problem. Suppose (X n ) is a 
stationary source as above, and let {Qg : 9 E 0} be a family of probability distributions on the 
reproduction alphabet A parameterized by an arbitrary parameter space 0. The plug-in estimator for 
Rf(P,D) is Rf(P x ™,D), and we seek conditions for its strong consistency. 

Note that Rf(P,D) = Rx(P,D) when {Q g : 9 E 0} includes all probability distributions on A, 
or if it simply includes the optimal reproduction distribution achieving the infimum in £[]). Otherwise, 
Rf(P,D) may be strictly larger than Ri(P,D). Therefore, the nonparametric problem is a special case 
of the parametric one, and we can consider the two situations in a common framework. 

In the parametric scenario we write, 

Df(P) :={D>0: R?(P,D) = \im xn R° (P, XD)} . 
Unlike D C (P), Df(P) can exclude more than a single point. 



B. Consistency 

We investigate conditions under which the plug-in estimator Rf(P x n,D) is strongly consistent, i.e. JH 

Rf(P X n,D)^Rf(P,D). (4) 

Of course in the special case where indexes all probability distributions on A, this reduces to the 
nonparametric problem, and © becomes Ri(P x ™,D) Ri(P,D). We separately treat the upper and 
lower bounds that combine to give ©. 

The upper bound does not require any further regularity assumptions, although there can be certain 
pathological values of D for which it is not valid. In the nonparametric situation, the only potential 
problem point is the single value of D where Ri(P, D) transitions from finite to infinite. 

Theorem 4: If the source (X n ) is stationary and ergodic with Xi ~ P, then 

\imsupRf{P X n,D) < Rf(P,D) 

n^oo 

for all D E Df(P). 

2 Throughout the paper we do not require limits to be finite valued, but say that lim„ a n — oo if a n diverges to oo (and similarly for 

— oo). 
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As illustrated by a simple counterexample in Section ITII-D I the requirement that D E Df(P) cannot be 
relaxed completely. The proof of the theorem, given in Section |Vj is a combination of the decomposition 
in © and the fact that Ri(Px™, Q, D) Ri(P, Q, D) quite generally. Actually, from the proof we also 
obtain an upper bound on the lim inf , 

\immiRf(P x ™,D) < Rf(P,D) for all D > 0, (5) 

n— >oo 1 

which provides some information even for those values of D where the upper bound in Theorem |4] may 
fail. 

For the corresponding lower bound in ©, some mild additional assumptions are needed. We will always 
assume that is a metric space, and also that the following two conditions are satisfied: 
Al. The map 9 i— > Ee[e Xp ^ x ' Y ^] is continuous for each x E A and A < 0, where Eg denotes expectation 
w.r.t. Qg. 

A2. For each D > 0, there exists a (possibly random) sequence (9 n ) with 

w.p. 1 

liminfi^P^Q^L)) < lim inf Rf(P x ™,D), (6) 

n^oo n^oo 

and such that (6 n ) is relatively compact with probability 1. 
Theorem 5: If is separable, Al and A2 hold, and (X n ) is stationary and ergodic with X\ ~ P, then 

<-> W P-' r\ 

Hminf Rf(P X n,D) > Rf(P,D) 

for all D > 0. 

Although Al and A2 may seem quite involved, they are fairly easy to verify in specific examples: For 
Al, we have the following sufficient conditions; as we prove in Section |Vj either one implies Al. 
PI. Whenever 6 n — > 9, we also have that Qg n — > Qg setwiseJl 

Nl. (A, A) is a metric space with its Borel a-algebra, p(x, ■) is continuous for each x E A and 9 n — > 9 
implies that Qg n — > Qg weakly@ 
For A2, we first note that a sequence (9 n ) satisfying © always exists and that the inequality in © must 
always be an equality. The important requirement in A2 is that (8 n ) be relatively compact. In particular, 
A2 is trivially true if is compact. More generally, the following two conditions make it easier to verify 
A2 in particular examples. In Section |V] we prove that either one implies A2 as long as the source is 
stationary and ergodic with marginal distribution P. For any subset K of the source alphabet A, we write 
B(K, M) for the subset of A which is the union of all the distortion balls of radius M > centered at 
points of K. Formally, 

B(K,M) := [j{y : p(x,y) < M}, K C A, M > 0. 

xeK 

P2. For each D > 0, there exists a A > and a e i such that P(K) > D/(D + A) and {9 : 
Qg(B(K, D + A)) > e} is relatively compact for each e > 0. 

N2. (A, A) is a metric space with its Borel cr-algebra, is the set of all probability distributions on A 
with a metric that metrizes weak convergence of probability measures, and for each e > and each 
M > there exists a K E A such that P{K) > 1 - e and B(K, M) is relatively compact^ 
In Section [HI] we describe concrete situations where these assumptions are valid. 

3 We say that Q m — > Q setwise if EQ m (f) — + Eq(/) for all bounded, measurable functions /, or equivalently, if Q m (C) — > Q(C) for 
all measurable sets C. 

4 We say that Q m — > Q weakly if Eg m (f) — > Eq(/) for all bounded, continuous functions /, or equivalently, if Q m (C) — > Q(C) for 
all measurable sets C with Q(dC) = 0. 

5 can always be metrized in this way, and so that will be separable (compact) if A is separable (compact) [6]. 
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The proof of Theorem [5] has the following main ingredients. The separability of 6 and the continuity 
in Al are used to ensure measurability and, in particular, for controlling exceptional sets. Al is a local 
assumption that ensures ini eeU Ri(P x ™, Qe, D) is well behaved in small neighborhoods U . hi is a global 
assumption that ensures the final analysis can be restricted to a small neighborhood. 

Combining Theorems |4] and |5] gives conditions under which Rf(Px™,D) ^> Rf(P,D). In the non- 
parametric situation we have the following Corollary, which is a generalization of Corollary [3] in the 
Introduction; it follows immediately from the last two theorems. 

Corollary 6: Suppose (A, A) is a compact, separable metric space with its Borel cr-algebra and p(x, •) is 
continuous for each x G A. If (X n ) is stationary and ergodic with X% ~ P, then Ri(P x ™, D) ^ Ri(P, D) 
for all D G D C (P). Furthermore, the compactness condition can be relaxed as in N2. 

III. Examples 

In all of the examples we assume that the source (X n ) is stationary and ergodic with X\ ~ P. 

A. Nonparametric Consistency: Discrete Alphabets 

Let A and A be at most countable and let p be unbounded in the sense that for each fixed x G A and each 
fixed M > there are only finitely many y G A with p(x, y) < M. Nl and N2 are clearly satisfied in the 
nonparametric setting where is the set of all probability distributions on A, so Ri(Px™, D) ^> R\(P, D) 
for all D except perhaps at the single value of D where Ri(P, D) transitions from finite to infinite. If, 
in addition, for each x there exists a y with p(x,y) = 0, then D C (P) = [0, oo) regardless of P [11], and 
the plug-in estimator is strongly consistent for all P and all D. 

This example also yields a different proof of the general consistency result mentioned in the Introduction, 
for the plug-in estimate of the entropy of a discrete-valued source: If we map A = A into the integers, 
let p(x, y) = \x — y\, and take D — 0, then we obtain the strong consistency of [2, Cor. 1]. 

B. Nonparametric Consistency: Continuous Alphabets 

Again in the nonparametric setting, let A = A = R d be finite dimensional Euclidean space, and 
let p(x,y) := f(\\x — y\\) for some function / of Euclidean distance where / : [0, oo) — > [0, oo) is 
continuous and /(£) — > oo as t — > oo. As in the previous example, Nl and N2 are clearly satisfied, so 
Ri(Px™,D) Ri(P, D) for all D except perhaps at the single value of D where R\(P,D) transitions 
from finite to infinite. If furthermore /(0) = 0, then D C (P) = [0, oo) regardless of P [11] and the plug-in 
estimator is strongly consistent for all P and all D. 

This example includes the important special case of squared-error distortion: In the nonparametric 
problem, the plug-in estimator is always strongly consistent under squared-error distortion over finite 
dimensional Euclidean space, as stated in Corollary [2] This example also generalizes as follows. The 
alphabets A and A can be (perhaps different) subsets of ~§t d , as long as A is closed. The use of Euclidean 
distance is not essential and we can take any p > f, so that p is not required to be translation invariant, 
as long as p is continuous over A for each fixed x G A. This is enough for consistency except perhaps 
at a single value of D. To use the results in [11] to rule out any pathological values of D, that is, to 
show that D c = [0, oo) we also need A to be closed, p to be continuous over A for each fixed y and 
inf y p(x, y) = for each x. 

C. Parametric Consistency for Gaussian Families 

Let A = A = R, let p satisfy the assumptions of Example IIII-B L let = {(p, a) G R x [0, oo)} with 
the Euclidean metric, and for each 9 = (p, a) let Qg be Gaussian with mean p and standard deviation 
a [the case a = corresponds to the point mass at p]. Conditions Nl and P2 are clearly satisfied, 
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so Ef(Pxn,D) ^ Rf(P,D) for all D e Df{P). In the special case where p{x,y) = (x - y) 2 is 
squared-error distortion, then it is not too difficult [15] to show that 

1 2 

Pf(P,P) = max{o,-log^}, 

where a\ denotes the (possibly infinite) variance of P, so Df(P) = [0, oo) and the convergence holds 
for all D. Furthermore, if the source P happens to also be Gaussian, then Rf(P, D) = Ri(P, D) and the 
plug-in estimator is also strongly consistent for the nonparametric problem. 

D. Convergence Failure for D (jL D C (P) 

Let A = {0, 1}, A = {0}, and p(x,y) := \x — y\. Since there is only one possible distribution on A, it 
is easy to show that 

Mr,m = {° T (1) - D 

I oo otherwise 

for any distribution P' on A. If P(l) > 0, the only possible trouble point for consistency is D = P(l), 
which is not in D C (P). It is easy to see that convergence (and therefore consistency) might fail at this 
point because Ri(Px? , D) will jump back and forth between and oo as Px»(l) jumps above and below 
D = P(l). The law of the iterated logarithm implies that this failure to converge happens with probability 
1 when the source is memory less. In general, when the source is stationary and ergodic, it turns out that 
convergence will fail with positive probability [24] [23] [20]. 

E. Consistency at a Point of Discontinuity in P 

This slightly modified example from Csiszar [11] illustrates that Ri(-,D) can be discontinuous at P 
even though the plug-in estimator is consistent. Let A = A = {1, 2, . . . }, let P' be any distribution on A 
with infinite entropy and with P'(x) > for all x, and let p(x,y) := P'(x)~ 1 l{x ^ y} + \x — y\. Note 
that Ri(P', D) = oo for all _dH This is a special case of Example IIII-AI so the plug-in estimator is always 
strongly consistent regardless of P and D. Nevertheless, D) is discontinuous everywhere it is finite. 

To see this, let the source P be any distribution on A with finite entropy H(P). Note that Ri(P, D) < 
P X (P, 0) = H(P) < oo. Define the mixture distribution P e := (1 - e)P + eP'. Then P e -> P in the 
topology of total variation^ (and also any weaker topology) as e | 0, but Ri(P e , D) -/-> Ri(P, D) because 
Ri(P e , D) > eRi(P',D/e) = oo for all e > 0. See CC§ below for a proof of this last inequality^ 

The key property of p in this example is that there exists a P' with Ri(P', D) = oo for all D. If 
such a P' exists, then Ri(-,D) will be discontinuous in the topology of total variation at any point P 
where Ri(P, D) is finite for exactly the same reason as above. Although this specific example is based on 
a rather pathological distortion measure, many unbounded distortion measures on continuous alphabets, 

6 Ri(P' , ■) = oo, because any pair of random variables (U, V) with U ~ P' and E[p(U, V)] < oo has I(U ; V) = oo. To see this, first 
note that E[p(U, V)] < oo implies that a(x) := Prob{V = x\U = a;} — > 1 as x — » oo; simply use the definition of p and ignore the 
\x — y\ term. Computing the mutual information and using the log-sum inequality gives I(U;V) > K + £^ P'(x)a(x) \og(a(x)/Q(x)), 
where V ~ Q and where n is a finite constant that comes from all of the other terms in the definition of I(U ; V) combined together with 
the log-sum inequality. Since a(x) — » 1 and since Yl x P'( x )^°s(^/Q( x )) ^ H{P') = oo for any probability distribution Q, we see that 
I(U; V) — oo. We can ignore a(x) because the finiteness of the sum only depends on the behavior for large x, and for large enough x we 
have a(x) > 1/2, say. 

7 The topology of total variation is metrized by the distance d(P, P') := sup c \P(C) — P'(C)\. 

8 An interesting special case of this example (based on the fact that ~}2 x>2 [x log" x]^ 1 converges if and only if a > 1) is P'(x — 1) oc 
l/(x log 1 ' 5 x) (infinite entropy) and P{x — 1) oc l/(x log 2 ' 5 x) (finite entropy), x = 2, 3, ... , because the relative entropies H(P'\\P) and 
H(P\\P') are both finite, so H(P t \\P) — > and H(P\\P t ) — > as e I 0. (From the convexity of relative entropy.) This counterexample thus 
shows that even closeness in relative entropy between two distributions (which is stronger than closeness in total variation) is not enough to 
guarantee the closeness of the rate-distortion functions of the corresponding distributions. 
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including squared-error distortion on R, have such a P' and are thus discontinuous in the topology of 
total variation^] 



F. Higher-Order Rate-Distortion Functions 

Suppose that we want to estimate the mth-order rate-distortion function of a stationary and ergodic 
source (X n ) with mth order marginal distribution X{ n ~ P m , namely, 

R m (P m ,D):=- inf I(U;V), 

m (u,v)~wew m (P m ,D) 

where the infimum is over all A m x A m -valued random variables, with joint distribution W in the set 
W m (P m , D) of probability distributions on A m x A m whose marginal distribution on A m equals P m , and 
which have E[p m (U, V)} < D for 



m 

k=l 



1 m 



All our results above immediately apply to this situation. We simply estimate the first-order rate-distortion 
function of the sliding-block process (Z n ) defined by Z k := (X k , . . . ,Xk+ m -i) with source alphabet A m , 
reproduction alphabet A m and distortion measure p m , and then divide the estimate by m. 



IV. Further Results 

A. Estimation of the Optimal Reproduction Distribution 

So far, we concentrated on conditions under which the plug-in estimator is consistent; these guarantee 
an (asymptotically) accurate estimate of the best compression rate Rf(P,D) = inf g e ® R^(P,Qg, D) that 
can be achieved by codes restricted to some class of distributions {Qe ; 9 E 0}. Now suppose this 
infimum is achieved by some 9*, corresponding to the optimal reproduction distribution Q e *. Here we use 
a simple modification of the plug-in estimator in order to obtain estimates 9 n = 9 n (xi) for the optimal 
reproduction parameter 9* based on the data x r [. Specifically, since we have conditions under which 

inf R x {P x n, Q e , D) w inf R X {P, Q e , D), (7) 

we naturally consider the sequence of estimators which achieve the infima on the left-hand-side of CZ]) 
for each n > 1; that is, we simply replace the inf by an arginf. Since these arg-infima may not exist 
or may not be unique, we actually consider any sequence of approximate minimizers (9 n ) that have 
Ri(Px?, Qe„,D) Rf (Px n , D) in the sense that © below holds. Similarly, minimizers 9* of the right- 
hand-side of © may not exist or be unique, either. We thus consider the (possibly empty) set 9* containing 
all the minimizers of Ri(P, Qe, D) and address the problem of whether the estimators 9 n converge to 6*, 
meaning that 9 n is eventually in any neighborhood of 6*. 

Our proofs are in part based on a recent result from [24] [23]. 

Theorem 7: [24] [23] If the source (X n ) is stationary and ergodic with X 1 ~ P, then 

liminf R X {P X ^Q,D) ^ R 1 (P,Q,D) 

n^oo 

for all D > and 

lim R^Px^ Q, D) ^' R 1 {P, Q, D) (8) 

n^oo 

'For squared-error distortion, let P' be any distribution over discrete points {xi,X2, . ..} C R where > x^^i + 2 1//p and where 
H(P') = oo. This is essentially the same as Csiszar's example above because any pair of random variables (U, V) with E[p(U, V)] < oo 
must have Probf 1 !/ closer to x^ than any other Xj\U = x^} — > 1 as k — > oo. 
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for all D in the set 

D C (P, Q) := \d > : R 1 (P, Q, D) = lim R X {P, Q, XD) J . 

Similar to D C (P), D C (P,Q) always contains and any point where Ri(P,Q,D) = oo. Since the 
function R\(P, Q, D) is convex and nonincreasing in D [24] [23], D C (P, Q) is the entire interval [0, oo), 
except perhaps the single point where Ri(P,Q,D) transitions from finite to infinite. 

Somewhat loosely speaking, the main point of this paper is to give conditions under which an infimum 
over Q can be moved inside the limit in the above theorem. It turns out that our method of proof works 
equally well for moving an arg-infimum inside the limit. The next theorem, proved in Section fVl is a 
strong consistency result giving conditions under which the approximate minimizers (9 n ) converge to the 
optimal parameters {9*} corresponding to the optimal reproduction distributions {Qe*}- 

Theorem 8: Suppose the source (X n ) is stationary and ergodic with X x ~ P, the parameter set 6 is 
separable, and Al and A2 hold. Then for all D e Df(P), the set 

0* :=arg inf i2f 
eee 

is not empty and any (typically random) sequence (9 n ) of approximate minimizers, i.e., satisfying, 

lim sup Ri {P X n ,Q 9n ,D)< lim sup Rf (P x « ,D), (9) 

n^oo n— >oo 

has all of its limit points in 6* with probability 1. Furthermore, if Rf(P,D) < oo and either P2 or N2 
holds, then any sequence of approximate minimizers (0 n ) is relatively compact with probability 1. Hence, 
n — > @* with probability 1. 



B. More General Estimators 

The upper and lower bounds of Theorems 0] and |5] can be combined to extend our results to a variety 
of estimators besides the ones considered already. For example, instead of the simple plug-in estimator, 

Rf(P x n 7 D) = inf R^P^QoiD) 

we may wish to consider MDL-style penalized estimators, of the form, 

inf \R 1 (P^,Q e ,D) + F n (9)}, (10) 

for appropriate (nonnegative) penalty functions F n {9). The penalty functions express our preference for 
certain (typically less complex) subsets of 6 over others. This issue is, of course, particularly important 
when estimating the optimal reproduction distribution as discussed in the previous section. Note that in 
the case when no distortion is allowed, these estimators reduce to the classical ones used in lossless data 
compression and in MDL-based model selection [13]. Indeed, if A = A are discrete sets, p is Hamming 
distance and D = 0, then the estimator in (fTOl ) becomes, 

-- sup{lo g Q?(ar?)-TiF n (0)), 

which is precisely the general form of a penalized maximum likelihood estimator. [As usual, Q n denotes 
the n-fold product distribution on A n corresponding to the marginal distribution Q.] 

More generally, suppose we have a sequence of functions (yj n (x", 9, D)) with the properties that, 

(p n (a%, 9, D) > Ri(P x n, Q , D) (11a) 
lim sup ip n {X?,9,D) ^limsup-RiCP^Q^D) (lib) 
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for all n, x™, 9 and D. For each such sequence of functions (<p n ), we define a new estimator for Rf(P, D) 

by, 

:= inf 0,2?). 

Condition (II lab implies that any lower bound for the plug-in estimator also holds here. Also, by considering 
a single 9' for which 

w.p.l 

lim sup i?x (P x « , Q 9 , , D) < Rf {P, D) + e, 

n 

we see that (lllbl) similarly implies a corresponding upper bound. We thus obtain: 

Corollary 9: Theorems \5\ and [8] remain valid if Rf(Px™,D) is replaced by (p®(X™,D) for any 
sequence of functions (ip n ) satisfying (II lal) and (II lbl) . 

For example, the penalized plug-in estimators above satisfy the conditions of the corollary, as long as 
the penalty functions F n satisfy, for each 9, F n {9) — > as n — > oo. 

Another example is the sequence of estimators based on the "lossy likelihoods" of [21], namely, 

<p n {xl, 9,D) = -- \ogQ n e (B n (xl D)) 
n 

where B n (xi, D) denotes the distortion-ball of radius D centered at x™, 

D) := G A n :±J2p( Xk ,y k ) < d\ , 
cf. [14]. Again, both conditions (II lal ) and (lllbl) are valid in this case [24] [23]. 

C. Nonstationary Sources 

As mentioned in the introduction, part of our motivation comes from considering the problem of 
estimating the rate-distortion function of distributions P which cannot be computed analytically, but 
which can be easily simulated by MCMC algorithms, as is very often the case in image processing, for 
example. Of course, MCMC samples are typically not stationary. However, the distribution of the entire 
sequence of MCMC samples is dominated by (i.e., is absolutely continuous with respect to) a stationary 
and ergodic distribution, namely, the distribution of the same Markov chain started from its stationary 
distribution, which is of course the target distribution P. Therefore, all of our results remain valid: Results 
that hold with probability 1 in the stationary case necessarily hold with probability 1 in the nonstationary 
case. The only minor technicality is that the initial distribution of the MCMC chain needs to be absolutely 
continuous with respect to P. 

More generally (for non-Markov sources), the requirements of stationarity and ergodicity are more 
restrictive than necessary. An inspection of the proofs (both here and in the proof of Theorem [7] in [24] 
[23]), reveals that we only need the source to have the following law-of-large-numbers property: 

LLN. There exists a random variable X taking values in the source alphabet A, such that, 

l n ELJ(x k )^E[f(x)}, 

for every nonnegative measurable function /. 

Theorem 10: Theorems @J [5] and [H Corollary [9] and the alternative conditions for A2 remain valid if, 
instead of being stationary and ergodic with X\ ~ P, the source merely satisfies the LLN property for 
some random variable X ~ P. If the distortion measure p is bounded, then the LLN property need only 
hold for bounded, measurable /. 
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Every stationary and ergodic source satisfies this LLN property as does any source whose distribution is 
dominated by the distribution of a stationary and ergodic source. This LLN property is somewhat different 
from the requirement that the source be asymptotically mean stationary (a.m.s.) with an ergodic mean 
stationary distribution [17]. The latter is a stronger assumption in the sense that / can depend on the 
entire future of the process, i.e., rT 1 Ylk=i fi-^kt^k+i, ■ ■ •) ~^ E[f(X°°)], where X°° is now a random 
variable on the infinite sequence space. It is a weaker assumption in that this convergence need only 
hold for bounded /. The final statement of Theorem \T0\ implies that our consistency results hold for 
a.m.s. sources (with ergodic mean stationary distributions) as long as the distortion measure is bounded. 



V. Proofs 

We frequently use the alternative representation [24] [23] 

R x (P, Q,D) = sup \XD - E x ^p [log E Y ^ Q e 

A<0 L 



Xp(X,Y)l 



(12) 



which is valid for all choices of P, Q and D. 
This representation makes it easy to prove that 

Rf(eP' + (1 - e)P, D) > eRf{P', D/e) 

for e G (0, 1), which is used above in Example IIII-EI Indeed, 

R 1 (eP' + (l-e)P,Q e ,D) 

XD - eE x ^ P , [log E Y ^ Qe e xp{x ' ¥) ] - (1 - e)E x „ P [log E Yr 

XD — eE x ~p> [\ogE Y ^ Qg [e xpix ^ 

XD/e - E X ^ P , [log E Y ^ Qe \e Xp{xy) \\ 

= eR 1 (P',Q e ,D/e). 
Taking the infimum over 9 6 on both sides gives (TT3l . 



(13) 



= sup 

A<0 

> sup 

A<0 

= e sup 

A<0 



A. Measurability 

Here we discuss the various measurability assumptions that are used throughout the paper. Note that 
we do not always establish the measurability of an event if it contains another measurable event that has 
probability 1. 

Since p is product measurable, x i— > E e [e Xp ^ x ' Y ^] is measurable. This implies that h- > XD — 

Ep x „ | logi?6i[e Ap ( x ' y )] j is measurable. Since this is concave in A [23], we can evaluate the supremum 

over all A < in (fT2l) by considering only countably many A < 0, which means that x" t— > Ri(P x «, Qg, D) 
is measurable. 

If is a separable metric space and / : x A n — > R is measurable for fixed 6 £ and continuous 
for fixed x r [ E A n , then x r [ i— > sup g€U f(6, x™) is measurable for any subset U C 0. This is because 
sup eg[7 / = supgg^/ / for any (at most) countable dense subset U' C U, and the latter is measurable 
because U' is (at most) countable. Since is separable, such a U' always exists, and since f(-,Xi) is 
continuous, U' can be chosen independently of x™. An identical argument holds for lr&e^u f. We make 
use of this frequently in the lower bound, where the necessary continuity comes from Al. 
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B. Proof of Theorem H] 

The upper bound in Theorem |4] is deduced from Theorem [7J as follows. If D = or Rf(P,D) = oo, 
then choose D' = D, otherwise, choose D' < D such that Rf (P, D') < Rf(P, D) + e/2. We can always 
do this since D E Df(P). Now pick 9 E with Ri(P,Q e ,D') < Rf(P,D') + e/2. This ensures that 
D E D c (P,Q g ) and Theorem [7J gives 

Mrnsnp Rf(P x ^D) < lim sup R x (P x ? ,Qo,D) ^ R x (P,Q e ,D) < R x (P,Q e ,D') < Rf(P,D) + e 

n— »oo n—*oo 

completing the proof. Notice that if we switch the lim sup to a liminf, we can remove any restrictions 
on D since there are no restrictions in this case in Theorem [7J This gives ©. 



C. Proof of Theorem \5\ 

Here we prove the lower bound of Theorem \5\ Let r denote the metric on and let 0(9, e) := {9 1 : 
9) < e} denote the open ball of radius e centered at 9. The main goal is to prove that 



T 



w.p. 1 



lim liminf inf R x (P X n, Q e ,, D) > R x (P,Q g ,D) 

ej.0 n^oo 6»'GO(0,e) 1 



(14) 



for all 9 E simultaneously, that is, the exceptional set can be chosen independently of 9. To see how 
this gives the lower bound, first choose a sequence (9 n ) according to A2 and a subsequence (n^) along 
which the liminf on the left side of © is actually a limit. Let 9* be a limit point of the subsequence 
(9 nk ). Note that such a 9* exists with probability 1 by assumption A2 and that it depends on X™. We 
have, 



liminf B»(P X »,D) > liminf R x (P X n, Qe n , D) = lim R x (P x n k ,Q e D) 

w.p.l 

> liminf inf R x (P X n Qe>, D) 

?woo 6»'eO(e*,e) 1 



(15) 



for each e > 0. The first inequality is from © and the last is valid because infinitely many elements of 
(0 n J are in 0(9*, e) for any e > 0. Letting e j in ([151) and using (fT4l) gives 



w.p. 1 



limM R?(P X ?,D) > R 1 (P,Q e *,D)>RV(P,D) 

n— >oo 

as desired. Note that with © this also implies that that 9* achieves the infimum in the definition of 

R?(P,D). 

We need only prove (fT4l) . For any A < 0, 9 E and e > 0, the pointwise ergodic theorem gives 



n 

lim-V sup log^[ e A ^' y )] w ^^ 



fc=l 



9'eO(0,e) 



sup log£ e ,[e A ' (x ' y) ] 

0'£O(6,e) 



(16) 



(See Section IV-AI for measurability.) Fix an at most countable, dense subset C 0. We can choose the 
exceptional sets in (fT6l) independently of 9 E and e > rational. For any 9 E and e > we can 
choose a 9 E Q and a rational e > e such that 0(9, e) C 0(9, e) C 0(#, 2e). Since the exceptional sets in 
(fT6b do not depend on # and e, we have that 

^ n 1 n 

lim sup sup - V log E B > [ e x ^ x ^ Y h < lim sup - V sup log E v [ e xp{Xk ' Y) ] 



n->oo 6»' e o(6»,e) W 



fc=l 



1 " 

< lim sup sup log^/[e A ^'^] w - ^ 



k =i e>eO(e,e) 
sup log^[e A " (x ' y) ] 

0'£O(9,2t) 



~ 6'£0(8,e) 

sup logE e ,[e xp{x > Y) ] 



(17) 
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simultaneously for all 9 G 9 and e > 0, that is, the exceptional set can be chosen independently of 9 and 
e. 



The monotone convergence theorem and the continuity in Al give 

-E P 

Combining this with (TTTT) and letting e { gives 



lim Ep 

40 



sup \ogE e ,[e xp{x > Y) } 

9'GO(6»,2e) 



lim sup \ogE e ,[e xp{x > Y) } 

e J-° 8'£0(6,2e) 



E P [logE 6 [e Xp ^]} 



1 n 

lim lim sup sup - ^ \ogE e ,[e xp{Xk > Y) ] <' E P [log E [e xpix ' Y) ]] 



40 n ->oo 0>£O(6,e) n 



(18) 



k=l 



simultaneously for all 9 G O. 

Both sides of (fT8~l) are nondecreasing with A. Furthermore, the right side of (fT8~l) is continuous from 
above for A < 0. (To see this, use the dominated convergence theorem to move the limit through Eg and 
the monotone convergence theorem to move the limit through Ep.) These two facts imply that we can 
also choose the exceptional sets independently of A < (by first applying (fT8l for A rational and then 
squeezing). Applying ([TBI to the representation in (fT2)) gives, for each A < 0, 



limliminf inf R\(Px n , Qe 1 , D) > limliminf inf 

40 n-»oo e'£0(9,e) 1 40 n^oo 6»'eO(6»,<=) 



1 n 

A-D y~] log Eqi [t 



fc=i 



1 

= XD - lim lim sup sup - V log ^ [ e Ap(Xfc ' y) ] "> XD - £ P [log E e [e xp{x ' Y) 

40 n ^oo 0i£O(0,e) n ^ 

simultaneously for all ^ G and A < 0. Optimizing over A < on the right gives (fl4l) . 



Z). Alternative Assumptions 

Here we discuss the various alternative assumptions that imply Al and A2. PI implies Al because 
y i — > e Xp ( x ' yS) is bounded and measurable for each x E A and A < 0. Nl implies Al because y f— > e xp ^ x,y > 
is bounded and continuous for each x G A and A < 0. 

1) P2 Implies A2: Here we prove that P2 implies A2 when (X„) is stationary and ergodic with X x ~ P. 
Fix .D, A and K according to P2, so that T e := {9 : Q e (B(K, D + A)) > e} is relatively compact for 
each e > 0. We will first show that 



limliminf inf Ri(Px n , Qe, D) w = oo 

40 n-^oo 6>eT| 1 



(19) 



where T e c denotes the complement of T e . If T e c is empty for some e > 0, then (1191) follows from the 
convention that inf = oo. We can thus focus on the case where T e c is not empty for all e > 0. 
Define A e := (loge)/(D + A). Since 

p(x, y)>(D + A)t{x EK,ye B(K, D + A) c } 

we have for any 9 G T e c 

\ogE e [e x ^ Y) ] < l{x G K}log [e + e A ^ D+A )] = t{x G K}log{2e). 
This and the representation in (fl"2l) imply that 



inf R^Px^Q^D) > inf 



eer. 



X f D 



1 n 

- V log E e [t 
n f— i 



X,p(X k ,Y)] 



k=l 



> 



D 



1 71 

loge--]Tl{X fc GK}log(2 e ). 

k=l 
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Taking limits, the pointwise ergodic theorem gives 

liminf inf R x {P X n, Q , D) > — ^— log e - P{K) log(2e). (20) 

Letting e J. (e rational) and noting that P(K) > D/(D + A) by assumption gives (fl9l ). 

Now we will show that (fl9l) implies A2. Fix a realization of X~ for which (fT9l) holds. Let (rife) 
be a subsequence for which 

L := liminf Rf(P xnj D) = lim Rf(Pn k ,D). 

n—*oo 1 k^oo 1 

If L = oo, we can simply take 9 n = 9 for any constant and all n. If L < oo, choose #„ fe so that 

lim R x {P« h ,Q 6 D) = L. 

k— >oo 1 K 

Then (fT9l) implies that there exists an e > for which 9 nk must be in T e for all k large enough. Since 
T e has compact closure, the subsequence (9 nk ) is relatively compact and it can always be embedded in a 
relatively compact sequence (9 n ). Since xf is (with probability 1) arbitrary, the proof is complete. 

2) N2 Implies A2: Here we prove that N2 implies A2 when (X n ) is stationary and ergodic with 
X\ ~ P. For each e > and each M > 0, let K(e, M) be the set in N2. The pointwise ergodic theorem 
gives, 

lim P x ?{K{e,M)) ^ P{K{e,M)). (21) 

Fix a realization xf of X] 30 for which (1271) holds for all rational e and M. Let (n k ) be a subsequence for 
which 

L := liminf #f (P x n, D) = lim #f (P n fc , D). 

If L = oo, we can simply take 6 n = 6 for any constant # and all If L < oo, for k large enough both 
sides are finite and we can choose W k € W(P x "k, D) so that 

H(W k \\W k A x Wt) < Rf{P x ^k,D) + 1/k. 

Let Qe„ fe = and note that 

Ri(P x ^ , , < #?(^ , £>) + V*. 

We will show that 9 rik is relatively compact by showing that the sequence (Q k := Qe„ k ) is tight0 This 
will complete the proof just like in the previous section. 

Fix e > rational and M > 2D/e rational. Let K = K(e/2, M). We have 

D > E {uy ^ Wk [p(U,V)} > MW k (K x B(K,M) C ) > 2DW k (K x B(K, M) c )/e. 

This implies that W k (K x B(K, Mf) < e/2 and we can bound 

Q k (B(K,M)) = W£{B(K,M)) > W k (K x B(K, M)) = P <k {K) - W k (K x B(K,M) C ) 
>P x n k (K)-e/2. 

Taking limits and applying (|2TT) gives 

liminf Q k (B(K,M)) > P(K)-e/2 > 1 - e. 

n— >oo 

Since B(K, M) has compact closure and since e was arbitrary, the sequence (Q k ) is tight. 

10 A sequence of probability measures (Qk) on (A, A) is said to be tight if sup F liminffc_ 00 Qk(F) = 1, where the supremum is over 
all compact (measurable) F C A. If (Qk) is tight, then Prohorov's Theorem states that it is relatively compact in the topology of weak 
convergence of probability measures [25]. 
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E. Proof of Theorem \8\ 

Here we prove the convergence-of-minimizers result given in Theorem [8] The proof of Theorem [5] in 
Section IV-CI shows that 0* is not empty. The assumptions ensure that both the lower and upper bounds 
for consistency of the plug-in estimator hold, so that Rf(Px?,D) Rf(P,D). This shows that any 
sequence (9 n ) satisfying © also satisfies © with probability I, and that the limsup and the liminf 
agree. Let 9* be any limit point of this sequence (if one exists). Following the steps at the beginning of 
the proof of Theorem \5\ in Section IV-Cl we see that 9* G O*. 

Now further suppose that Rf(P,D) is finite so that 

Ri(Px?, Qe n ,D) "A 1 Rf(p, D) < M < oo. (22) 

We want to show that the sequence (9 n ) is relatively compact with probability 1. If P2 holds, then ( TT9b 
immediately implies that there exists an e > such that 9 n G T e eventually, with probability 1. Since T e 
is relatively compact, so is (9 n ). 

Alternatively, suppose N2 holds. To show that (9 n ) is relatively compact with probability 1, we need 
only show that (Qe n ) is tight w.p.l. Fix a realization where the convergence in (|22l) holds, where (1211) 
holds for all rational e and M, and where Rf(P x ™, D) — > Rf(P, D). For n large enough, the left side of 
(1221) is finite, so W(P X ™, D) is not empty and we can choose a sequence (W n ) with W n G W(P X «, D) so 
that 

H(W n \\P x n X Qg n ) ^ Rf(P,D). 

Let Q n := W£- An inspection of the above proof that N2 implies A2 shows that the sequence (Q n ) 
is tight. We will show that H(Q n \\Qe n ) — > 0, implying that (Qe n ) is also tight (because, for example, 
relative entropy bounds total variation distance). Indeed, 




Since a n and b n both converge to Rf(P, D), which is finite, c n — > 0, as claimed. 



F. Proof of Theorem {T6\ 

Here we prove the result of Theorem [TOl based on the law-of-large-numbers property. Inspecting all 
of the proofs in this paper reveals that the assumption of a stationary and ergodic source is only used to 
invoke the pointwise ergodic theorem. Furthermore, the pointwise ergodic theorem is not needed in full 
generality, only the LLN property is used. The relevant equations are (fT6l) . (l20l) and (1271) . Note that if p 
is bounded, then it is enough to have the LLN property hold for bounded /. 

Equation © from Theorem [71 which we used in the proof of the upper bound, also assumes a stationary 
and ergodic source. The proof of a more general result than Theorem |7J is in [24] [23], but that result 
makes extensive use of the stationarity assumption. A careful reading reveals that only the LLN property 
is needed for ([8]). For completeness, we will give a proof, referring only to [23] for results that do not 
depend on the nature of the source. Specifically, what we need to prove for the upper bound is that 

w.p.l 

limsup Rx(P X n,Q,D) < R^P&D) (23) 

n^oo 

for all D G D C (P,Q). 

If the source satisfies the LLN property for a random variable X with distribution P, then 

1 n 

lim - V hgE Y ^ Q [e^ x ^} ^' E X ^ P [log E Y ^ Q [e x ^}} := A(A). (24) 

n^oo n *— 1 ' 

fc=l 
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Furthermore, since both sides are monotone in A, the exceptional sets can be chosen independently of A. 
The LLN property also implies that 

lim - J2 E Y ^ Q [p(X k , Y)] ^ E X ^ P [E Y ^ Q [p(X, Y)}} : = D ave . (25) 

k=l 

Note that if p is bounded, then the LLN property need only hold for bounded / in both (1241) and ([251) . 

Define A*(D) : = sup A<0 [XD - A(A)] and D min := inf{L> > : A*(D) < oo}, with the convention 
that the infimum of the empty set equals +oo. In [23] it is shown that -D m i n < -Dave, that A* is convex, 
nonincreasing and continuous from the right, and that 

{oo if D < D min 

strictly convex if D min < D < D avc 
if D > D aYe 

where some of these cases may be empty. Notice that A* is continuous except perhaps at D min , where it 
will not be continuous from the left if A*(D mhl ) < oo. 

Fix a realization of X™ for which (|24j) holds for all A and for which (l25l) holds. Define the random 
variables 



1 

z n ■■= ~y^p(x k ,Y k ) 



n 

k=l 



for n > 1, where the sequence (Y k ) consists of independent and identically distributed (i.i.d.) random 
variables with common distribution Q. Then (|24l) implies that 



We will first show that 



lim - \ogE[e XnZn ] = A(A). (26) 

n^oo n 

lim --logProb{Z n < D] = A*(D) = R(P,Q,D) (27) 

for all D > except the special case when both D = D min and A*(D min ) < oo. The second equality in 
(1271 is always valid [24] [23]. If D < D min , or D = D min and A*(D min ) = oo, or D min < D < L> avc , 
the first equality in (|27l ) follows from [23, Lemma 11], which is a slight modification of the Gartner-Ellis 
Theorem in the theory of large deviations. The aforementioned properties of A* and the convergence in 
(|26|) are what we need to use [23, Lemma 11]. If D > D avc , then A*(D) = and we need only show 
that liminf„ Prob{Z n < D} > 0. But this follows from Chebychev's inequality and (1231) because 

Prob{Z„ < D} = 1 - Prob{Z n > D} > 1 - E[Z n )/D -> 1 - D mc /D > 0. 

This proves (|27l ), except for the special case when D = D min and A*(D min ) < oo - which exactly 
corresponds to D ^ D C (P, Q). 

Finally, ([27]) gives (El because [24] [23] 

Ri(P Xi ,Q,D) < --logProb{Z„ < D} 
n 

and because xf 3 is (with probability 1) arbitrary. 
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